CS 838 – Chip Multiprocessor Prefetching

نویسندگان

Kyle Nesbit

Nick Lindberg

چکیده

1. Introduction Over the past two decades, advances in semiconductor process technology and microarchitecture have led to significant reduction in processor clock periods. Meanwhile, advances in memory technology have led to ever increasing memory densities, but relatively minor reductions in memory access time. Consequently, memory latencies measured in processor clock cycles are continually increasing and are now on the order of hundreds of clock cycles in duration. Cache memories help bridge the processor-memory latency gap, but, of course, caches are not always effective. Cache misses to main memory still occur, and when they do, the penalty is very high. Probably the most basic technique for enhancing cache performance is to incorporate prefetching. As the processor-memory latency gap continues to increase, there is a need for continued development and refinement of prefetch methods. Most existing prefetching research has focuses on uniprocessor prefetching. In this paper, we investigate cache prefetching, aimed specifically at prefetching in a Chip Multiprocessor (CMP). Prefetching in a CMP system has very different constraints than uniprocessor prefetching. In a CMP, pin bandwidth and the number of transaction buffer entries (TBEs, the maximum number of outstanding memory requests) are much more important. Multiple processors are competing for off-chip bandwidth and TBEs, reducing the systems tolerance to inaccurate prefetches, where prefetch accuracy is the percent of prefetches that are accessed by demand fetches before they are evicted from the cache. Inaccurate prefetches waste system resources, increase bus contention, and can degrade overall system performance. Furthermore, in a directory-based system with multiple CMPs, memory latency is extremely important. Often these systems store the directory in memory, which may require a memory access to retrieve, which effectively doubles the latency of the request (assuming memory access times is much larger than the bus transaction time [5]). The CMP prefetching method we study is based on " stride stream buffer prefetching concentration zones " (CZones) [14]. This method, as originally proposed, divides memory into fixed size zones and looks for stride patterns in sequences of cache misses directed toward the individual zones. When it finds a stride pattern, it launches prefetch requests. This method has the desirable property of not needing the program counter values of the load instructions that cause misses, which may not be readily available at lower levels of the memory hierarchy. Throughout the rest of this paper we support using CZone prefetching in a CMP system. In section 2, …

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Off-loading application controlled data prefetching in numerical codes for multi-core processors

An important issue when designing numerical code in High Performance Computing is cache optimization in order to exploit the performance potential of a given target architecture. This includes techniques to improve memory access locality as well as prefetching. Inherent algorithm constrains often limit the first approach, which typically uses a blocking technique. While there exist automatic pr...

متن کامل

Adaptive Sequential Prefetching in Multiprocessor Systems Using M 5

With the increase in popularity and performance of multiprocessor systems, the cost of memory latency between systems poses a growing issue to developers. Prefetching can address this problem by issuing multiple block requests on a single read miss, bringing in blocks in anticipation of use later. This paper presents an implementation of adaptive sequential prefetching using the M5 simulator.

متن کامل

Design of the HP PA 7200 CPU

The PA 7200 incorporates a number of enhancements specifically designed for a glueless four-way multiprocessor system with increased performance on both technical and commercial applications.10-11 On the chip is a multiprocessor system bus interface which connects directly to the Runway bus described in Article 2. The PA 7200 also has a new data cache organization, a prefetching mechanism, and ...

متن کامل

Exploiting the Potential of a Network of IRAMs

Recently, a great deal of research has gone into reducing the gap in performance between processors and their memory systems. Techniques such as prefetching have been developed in order to hide the long latencies involved in retrieving data from oo-chip DRAM. However, applications with irregular access patterns generally see greatly reduced beneet from these techniques, and latencies are becomi...

متن کامل

Joint Exploration of Hardware Prefetching and Bandwidth Partitioning in Chip Multiprocessors

In this paper, we propose an analytical model-based study to investigate how hardware prefetching and memory bandwidth partitioning impact Chip Multi-Processors (CMP) system performance and how they interact. The model includes a composite prefetching metric that can help determine under which conditions prefetching can improve system performance, a bandwidth partitioning model that takes into ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2003

CS 838 – Chip Multiprocessor Prefetching

نویسندگان

چکیده

منابع مشابه

Off-loading application controlled data prefetching in numerical codes for multi-core processors

Adaptive Sequential Prefetching in Multiprocessor Systems Using M 5

Design of the HP PA 7200 CPU

Exploiting the Potential of a Network of IRAMs

Joint Exploration of Hardware Prefetching and Bandwidth Partitioning in Chip Multiprocessors

عنوان ژورنال:

اشتراک گذاری